Tree reconstruction can be done with RAxML or Phylobayes. This section will cover one example in which we will build a supermatrix tree using RAxML, and a single gene tree using Phylobayes.



In [1]:

    
from reprophylo import *
pj = unpickle_pj('outputs/my_project.pkpj', git=False)









    



DEBUG:Cloud:Log file (/home/amir/.picloud/cloud.log) opened

3.9.1 Using RAxML

RAxML is configured with the RaxmlConf object. This object provides control over the following settings:

method_name: The method name.
program_name & cmd:. RAxML binaries exist in several versions. If you are using the Docker container you can leave this as is. The versions vary in the number of threads they utilized (PTHREADS or not), and the architecture they are optimized for (AVX or SSE3). raxmlHPC-PTHREADS-SSE3 is the default here, both as the program name and as the cmd. If you do not want to use multiple threads, you have to specify the name and command of the non PTHREADS binary, ie, raxmlHPC.
keepfiles: Whether or not to keep the output files in the working directory (the tree is stored in the Project)
preset: The RAxML algorithm. RaxmlConf has several preset algorithms:
- 'fa' - will run a single ML search with rapid bootstrap
- 'fD_fb' - will run a single ML tree with relBootstrap (quick and least accurate supports calculation)
- 'fd_b_fb' - will run one or more ML trees with thorough bootstrap (slow and accurate)
- 'fF_fJ' - will run a fast ML tree with sh-like supports (quick and dirty)
- 'fd_fJ' - will run one or more (proper) ML tree(s) with sh-like supports (quick supports calculation).
alns: Alignments to analyze. all by default. It can be modified by passing a list of trimmed alignment names and/or concatenation names.
model: The model of rate heterogeneity. for example, GAMMA (parametric) or CAT (non-parametric). The CAT model is a non parametric approach to categories the rate variation without calculating the GAMMA distribution, as a fast approximation. It is different than the CAT model in Phylobayes, where the number of parameters increases by categorizing the data to subsets, which differ in their substitution matrices and rate variation categories. The CAT in RAxML is "quick and dirty". The CAT in Phylobayes is "slow and accurate."
matrix: The protein substitution matrix. This parameter is only relevant to protein datasets, and it is ignored for DNA only datasets. By default it is set to 'JTT'. If it is a concatenated analysis, the string specified here will be set as the substitution matrix of each of the protein partitions. However, it is possible to pass a dictionary, instead of a string, containing the locus names as keys, and the name of substitution matrix assigned to each of them as values. Also important, partition information is taken into account automatically. No need to make a partition file.
threads: The number of threads to use. Using PTHREADS, threads=1 is automatically changed to 2. Using the non PTHREADS version, the threads number is set to one, regardless of the value the user passes.
cline_args: Other command line arguments, most importantly, the argument '-N' should be used to determine the number of ML searches (it is 1 by default and it doesn't work with fa or fF_fJ), and '-#' should be used to set the number of bootstrap replicates (it is 100 by default and it only works with fD_fb and fd_b_fb). -N and -# are not synonyms. This is different from the RAxML command line.

The RAxML manual is an important read, in order to understand all the analysis modifiers that can be passed, and to become familiar with the full range of models and substitution matrices available.

In this example, comments which specify item numbers, refer to the list just above. It will configure a concatenated analysis of the supermatrix 'large_concat', with the GTR GAMMA model for all the partitions, utilizing two threads and with two ML searches. Branch supports will be derived from a relBootstrap analysis.



In [2]:

    
raxml = RaxmlConf(pj,                                    # The Project 
                  method_name='supermatrix',             # Any string
                  program_name='raxmlHPC-PTHREADS-SSE3', # item 2
                  keepfiles=False,                       # False is default  
                  cmd='default',                         # item 2  
                  preset='fD_fb',                        # item 4
                  alns=['large_concat'],                 # item 5
                  model='GAMMA',                         # item 6
                  matrix='JTT',                          # item 7
                  threads=4,                             # item 8
                  cline_args={'-N': 2}                   # item 9
                  )









    



raxmlHPC-PTHREADS-SSE3 -f D -m PROTGAMMAJTT -n 221101440759352.8_large_concat0 -q 221101440759352.8_large_concat_partfile -p 603 -s 221101440759352.8_large_concat.fasta -T 4 -N 2
raxmlHPC-PTHREADS-SSE3 -f b -m PROTGAMMAJTT -n 221101440759352.8_large_concat1 -q 221101440759352.8_large_concat_partfile -p 369 -s 221101440759352.8_large_concat.fasta -t RAxML_bestTree.221101440759352.8_large_concat0 -T 4 -z RAxML_rellBootstrap.221101440759352.8_large_concat0

3.9.2 Using Phylobayes

In this example, a PbConf object is set to analyse a single trimmed alignment. The cline_args here are horrible and set this way for speed. The default settings, however, are sensible. Still, read the manual, at least the bits about nchain, burn-in and the proper usage of the GTR and/or CAT models (and others).



In [3]:

    
phylo = PbConf(pj,                                  # Default setting:
               method_name='single_gene',           # 'dna_cat_gtr'          
               program_name='phylobayes',
               keepfiles=False,                     # True
               cmd='default',
               alns=['28s@muscleDefault@gt50'],     # 'all'
               cline_args={'gtr': True,             
                           'cat': True,
                           'nchain': '2 50 0.9 5',  # '2 100 0.1 100'
                           'b': '1'                 # '5'
                           }
               )









    



pb -d 262531440759355.16_28s@muscleDefault@gt50.phylip -gtr -nchain 2 50 0.9 5 -b 1 -cat 262531440759355.16_28s@muscleDefault@gt50

3.9.3 Executing the tree reconstructions and accessing trees

This is done using the tree Project method:



In [4]:

    
pj.tree([raxml, phylo])

The resulting trees are placed in the pj.trees dictionary, with keys of the form 'locus_name@aln_method@trim_method@tree_method'. For trees from supermatrices the key is 'concat_name@mixed@mixed@tree_method'. The values in this dictionary are lists, each with two values. The first in an ETE Tree object, and the second is an NHX string representation of the tree.



In [5]:

    
pj.trees.keys()









    Out[5]:





['large_concat@mixed@mixed@supermatrix', '28s@muscleDefault@gt50@single_gene']

And as for alignment and trimming, we can review the approaches that we used:



In [6]:

    
print pj.used_methods['single_gene']









    



PbConf named single_gene with ID 262531440759355.16
Alignments: 28s@muscleDefault@gt50 
Created on: Fri Aug 28 11:55:55 2015
Commands:
28s@muscleDefault@gt50: ['pb -d 262531440759355.16_28s@muscleDefault@gt50.phylip -gtr -nchain 2 50 0.9 5 -b 1 -cat 262531440759355.16_28s@muscleDefault@gt50']

Environment:
Platform: Linux-3.13.0-40-generic-x86_64-with-Ubuntu-14.04-trusty
 Processor: x86_64
 Python build: defaultJun 22 2015 17:58:13
 Python compiler: GCC 4.8.2
 Python implementation: CPython
 Python version: 2.7.6
 ete2 version: 2.2rev1056
 biopython version: 1.64
 dendropy version: 3.12.0
 cloud version: 2.8.5
 reprophylo version 1.0
 User: amir-TECRA-W50-A
 Program and version: phylobayes version 3.3f
 Program reference: N. Lartillot
 T. Lepage and S. Blanquart
 2009: PhyloBayes 3: a Bayesian software package for phylogenetic reconstruction and molecular dating. Bioinformatics Vol. 25 no. 17.
execution time:
180.445002794

==============================
Core Methods section sentence:
==============================
Phylogenetic trees were reconstructed from the dataset(s) 28s@muscleDefault@gt50 using the program phylobayes version 3.3f [1].

Reference:
N. Lartillot, T. Lepage and S. Blanquart, 2009: PhyloBayes 3: a Bayesian software package for phylogenetic reconstruction and molecular dating. Bioinformatics Vol. 25 no. 17.

Tree objects can be fetched easily and manipulated with ETE tricks, using the ft Project method.



In [7]:

    
t = pj.ft('28s@muscleDefault@gt50@single_gene').convert_to_ultrametric(10)









    



returning tree object 28s@muscleDefault@gt50@single_gene

or written to a file in a suitable format



In [8]:

    
pj.ft('28s@muscleDefault@gt50@single_gene').write(features=['source_organism'], format=5, outfile="new_tree.nw")









    



returning tree object 28s@muscleDefault@gt50@single_gene



In [10]:

    
pickle_pj(pj, 'outputs/my_project.pkpj')









    Out[10]:





'outputs/my_project.pkpj'

3.9.4 Quick reference



In [ ]:

    
# Configure a raxml analysis
raxml = RaxmlConf(pj, **kwargs)

# Configure a phylobayes analysis
phylo = PbConf(pj, **kwargs)

# Execute tree reconstruction
pj.tree([list_of_RaxmlConf_and_or_PbConf_objects])

# Fetch an ETE Tree object
t = pj.ft('locus_name@aln_name@trim_name@tree_name')

# Write newick file
t.write(format=5, outfile="filename.nw")

# Write NHX format with all the qualifiers
t.write(features=[], format=5, outfile="filename.nw")